Author: Fidel Ramírez
This tutorial explores the visualization possibilities of scanpy and is divided into three sections:
In this tutorial, we will use a dataset from 10x containing 68k cells from PBMC. Scanpy, includes in its distribution a reduced sample of this dataset consisting of only 700 cells and 765 highly variable genes. This dataset has been already preprocessed and UMAP computed.
In this tutorial, we will also use the following literature markers:
With scanpy, scatter plots for tSNE, UMAP and several other embeddings are readily available using the sc.pl.tsne, sc.pl.umap etc. functions. See here the list of options.
Those functions access the data stored in adata.obsm. For example sc.pl.umap uses the information stored in adata.obsm['X_umap']. For more flexibility, any key stored in adata.obsm can be used with the generic function sc.pl.embedding.
import scanpy as sc
import pandas as pd
from matplotlib.pyplot import rc_context
from random import sample
sc.set_figure_params(dpi=100, color_map = 'viridis_r')
sc.settings.verbosity = 1
sc.logging.print_header()
scanpy==1.7.2 anndata==0.7.6 umap==0.5.1 numpy==1.20.2 scipy==1.6.2 pandas==1.2.4 scikit-learn==0.24.1 statsmodels==0.12.2 python-igraph==0.9.1
results_file = 'write/LDAVM02PythonFiltered.h5ad' # the file that will store the analysis results; 'write' must be a folder
adata = sc.read(results_file)
# inspect pbmc contents
adata
AnnData object with n_obs × n_vars = 12338 × 1963
obs: 'Sample', 'Age', 'Condition', 'batch', 'SampleDescription', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden'
var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'SampleDescription_colors', 'dendrogram_leiden', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
For the scatter plots, the value to plot is given as the color argument. This can be any gene or any column in .obs, where .obs is a DataFrame containing the annotations per observation/cell, see AnnData for more information.
# rc_context is used for the figure size, in this case 4x4
with rc_context({'figure.figsize': (4, 4)}):
sc.pl.umap(adata, color='Ifitm3')
Multiple values can be given to color. In the following example we will plot 6 genes: 'CD79A', 'MS4A1', 'IGJ', CD3D', 'FCER1A', and 'FCGR3A' to get an idea on where those marker genes are being expressed.
Also, we will plot two other values: n_counts which is the number of UMI counts per cell (stored in .obs), and bulk_labels which is a categorical value containing the original labelling of the cells from 10X.
The number of plots per row is controlled using the ncols parameter. The maximum value plotted can be adjusted using vmax (similarly vmin can be used for the minimum value). In this case we use p99, which means to use as max value the 99 percentile. The max value can be a number or a list of numbers if the vmax wants to be set for multiple plots individually.
Also, we are using frameon=False to remove the boxes around the plots and s=50 to set the dot size.
with rc_context({'figure.figsize': (3, 3)}):
sc.pl.umap(adata, color=['Tmem119', 'Nrxn1', 'Spp1', 'Mki67', 'Ifitm3', 'total_counts','pct_counts_mt', 'leiden'], s=50, frameon=False, ncols=4, vmax='p99')
In this plot we can see the groups of cells that express the marker genes and the agreement with the original cell labels.
The functions for scatterplots have many options that allow fine tuning of the images. For example, we can look at the clustering as follows:
# compute clusters using the leiden method and store the results with the name `clusters`
sc.tl.leiden(adata, key_added='leiden', resolution=0.25)
#adata.obs['clusters'] = adata.obs['leiden']
with rc_context({'figure.figsize': (5, 5)}):
sc.pl.umap(adata, color='clusters', add_outline=True, legend_loc='on data',
legend_fontsize=12, legend_fontoutline=2,frameon=False,
title='clustering of cells', palette='Set1')
Frequently, clusters need to be labelled using well known marker genes. Using scatter plots we can see the expression of a gene and perhaps associate it with a cluster. Here, we will show other visual ways to associate marker genes to clusters using dotplots, violin plots, heatmaps and something that we call 'tracksplot'. All of these visualizations summarize the same information, expression split by cluster, and the selection of the best results is left to the investigator do decide.
First, we set up a dictionary with the marker genes, as this will allow scanpy to automatically label the groups of genes:
marker_genes_dict = {
'B-cell': ['CD79A', 'MS4A1'],
'Dendritic': ['FCER1A', 'CST3'],
'Monocytes': ['FCGR3A'],
'NK': ['GNLY', 'NKG7'],
'Other': ['IGLL1'],
'Plasma': ['IGJ'],
'T-cell': ['CD3D'],
}
marker_genes_dict = {
'Macrophage': ['Pf4', 'Lyve1'],
'Proliferating': ['Mki67', 'Top2a'],
'Mature Microglia': ['Tmem119','P2ry12'],
'IFN-I': ['Ifitm3', 'Stat1'],
'Lysosomal': ['Cd68','Lyz2'],
'Engulfing': ['Nrxn1','Adgrb1']
}
A quick way to check the expression of these genes per cluster is to using a dotplot. This type of plot summarizes two types of information: the color represents the mean expression within each of the categories (in this case in each cluster) and the dot size indicates the fraction of cells in the categories expressing a gene.
Also, it is also useful to add a dendrogram to the graph to bring together similar clusters. The hierarchical clustering is computed automatically using the correlation of the PCA components between the clusters.
sc.pl.dotplot(adata, marker_genes_dict, 'leiden', dendrogram=True)
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different. categories: 0, 1, 2, etc. var_group_labels: Macrophage, Proliferating, Mature Microglia, etc.
Using this plot, we can see that cluster 4 correspond to B-cells, cluster 2 is T-cells etc. This information can be used to manually annotate the cells as follows:
# create a dictionary to map cluster to annotation label
cluster2annotation = {
'0': 'Homeostatic',
'1': 'Proliferating',
'2': 'Proliferating',
'3': 'Engulfing',
'4': 'Proliferating',
'5': 'Immature',
'6': 'IFN-I',
'7': 'Macrophages'
}
# add a new `.obs` column called `cell type` by mapping clusters to annotation using pandas `map` function
adata.obs['cell type'] = adata.obs['leiden'].map(cluster2annotation).astype('category')
sc.pl.dotplot(adata, marker_genes_dict, 'clusters', dendrogram=True)
WARNING: dendrogram data not found (using key=dendrogram_clusters). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently. WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different. categories: homeostatic-0, pro-1, pro-2, etc. var_group_labels: Macrophage, Proliferating, Mature Microglia, etc.
sc.pl.umap(adata, color='cell type', legend_loc='on data',
frameon=False, legend_fontsize=10, legend_fontoutline=2)
A different way to explore the markers is with violin plots. Here we can see the expression of CD79A in clusters 5 and 8, and MS4A1 in cluster 5.Compared to a dotplot, the violin plot gives us and idea of the distribution of gene expression values across cells.
marker_genes = ['Tgfbr1','Hexb','P2ry12','Tmem119','Mki67','Top2a','Nrxn1','Igf1','Spp1','Ifitm3','Pf4']
sc.pl.stacked_violin(adata, marker_genes, groupby='clusters', rotation=90);
with rc_context({'figure.figsize': (6, 3)}):
sc.pl.violin(adata, ['Tmem119', 'Ifitm3','Hexb'], groupby='clusters',rotation = 90 )
Note Violin plots can also be used to plot any numerical value stored in .obs. For example, here violin plots are used to compare the number of genes and the percentage of mitochondrial genes between the different clusters.
with rc_context({'figure.figsize': (4.5, 3)}):
sc.pl.violin(adata, ['n_genes_by_counts', 'pct_counts_mt'], groupby='clusters', stripplot=False, inner='box',rotation = 90) # use stripplot=False to remove the internal dots, inner='box' adds a boxplot inside violins
To simultaneously look at the violin plots for all marker genes we use sc.pl.stacked_violin. As previously, a dendrogram was added to group similar clusters
ax = sc.pl.stacked_violin(adata, marker_genes_dict, groupby='clusters', swap_axes=False, dendrogram=True)
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different. categories: homeostatic-0, pro-1, pro-2, etc. var_group_labels: Macrophage, Proliferating, Mature Microglia, etc.
A simple way to visualize the expression of genes is with a matrix plot. This is a heatmap of the mean expression values per gene grouped by categories. This type plot basically shows the same information as the color in the dotplots.
Here, scale the expression of the genes from 0 to 1, being the maximum mean expression and 0 the minimum.
sc.pl.matrixplot(adata, marker_genes_dict, 'clusters', dendrogram=True, cmap='Blues', standard_scale='var', colorbar_title='column scaled\nexpression')
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different. categories: homeostatic-0, pro-1, pro-2, etc. var_group_labels: Macrophage, Proliferating, Mature Microglia, etc.
Other useful option is to normalize the gene expression using sc.pp.scale. Here we store this information under the scale layer. Afterwards we adjust the plot min and max and use a diverging color map (in this case RdBu_r where _r means reversed).
# scale and store results in layer
adata.layers['scaled'] = sc.pp.scale(adata, copy=True).X
adata.layers
Layers with keys: scaled
sc.pl.matrixplot(adata, marker_genes_dict, 'clusters', dendrogram=True, cmap='RdBu_r', standard_scale='var', colorbar_title='column scaled\nexpression')
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different. categories: homeostatic-0, pro-1, pro-2, etc. var_group_labels: Macrophage, Proliferating, Mature Microglia, etc.
An axis can be passed to a plot to combine multiple outputs as in the following example
import matplotlib.pyplot as plt
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20,4), gridspec_kw={'wspace':0.9})
ax1_dict = sc.pl.dotplot(adata, marker_genes_dict, groupby='clusters', ax=ax1, show=False)
ax2_dict = sc.pl.stacked_violin(adata, marker_genes_dict, groupby='clusters', ax=ax2, show=False)
ax3_dict = sc.pl.matrixplot(adata, marker_genes_dict, groupby='clusters', ax=ax3, show=False, cmap='viridis')
Heatmaps do not collapse cells as in previous plots. Instead, each cells is shown in a row (or column if swap_axes=True). The groupby information can be added and is shown using the same color code found for sc.pl.umap or any other embedding.
ax = sc.pl.heatmap(adata, marker_genes_dict, groupby='clusters', cmap='viridis', dendrogram=True)
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different. categories: homeostatic-0, pro-1, pro-2, etc. var_group_labels: Macrophage, Proliferating, Mature Microglia, etc.
The heatmap can also be plotted on scaled data. In the next image, similar to the previus matrixplot the min and max had been adjusted and a divergent color map is used.
ax = sc.pl.heatmap(pbmc, marker_genes_dict, groupby='clusters', layer='scaled', vmin=-2, vmax=2, cmap='RdBu_r', dendrogram=True, swap_axes=True, figsize=(11,4))
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different. categories: 0, 1, 2, etc. var_group_labels: B-cell, Dendritic, Monocytes, etc.
The track plot shows the same information as the heatmap, but, instead of a color scale, the gene expression is represented by height.
ax = sc.pl.tracksplot(adata, marker_genes_dict, groupby='clusters', dendrogram=True)
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different. categories: homeostatic-0, pro-1, pro-2, etc. var_group_labels: Macrophage, Proliferating, Mature Microglia, etc.
Instead of characterizing clusters by known gene markers as previously, we can identify genes that are differentially expressed in the clusters or groups.
To identify differentially expressed genes we run sc.tl.rank_genes_groups. This function will take each group of cells and compare the distribution of each gene in a group against the distribution in all other cells not in the group. Here, we will use the original cell labels given by 10x to identify marker genes for those cell types.
sc.tl.rank_genes_groups(adata, groupby='clusters', method='wilcoxon')
?sc.pp.highly_variable_genes
The dotplot visualization is useful to get an overview of the genes that show differential expression. To make the resulting image more compact we will use n_genes=4 to show only the top 4 scoring genes.
sc.pl.rank_genes_groups_dotplot(adata, n_genes=4)
In order to get a better representation we can plot log fold changes instead of gene expression. Also, we want to focus on genes that have a log fold change >= 3 between the cell type expression and the rest of cells.
In this case we set values_to_plot='logfoldchanges' and min_logfoldchange=3.
Because log fold change is a divergent scale we also adjust the min and max to be plotted and use a divergent color map. Notice in the following plot that is rather difficult to distinguish between T-cell populations.
sc.pl.rank_genes_groups_dotplot(adata, n_genes=4, values_to_plot='logfoldchanges', min_logfoldchange=1, vmax=7, vmin=-7, cmap='bwr')
Next, we use a dotplot focusing only on two groups (the groups option is also available for violin, heatmap and matrix plots). Here, we set n_genes=30 as in this case it will show all the genes that have a min_logfoldchange=4 up to 30.
sc.pl.rank_genes_groups_dotplot(adata, n_genes=30, values_to_plot='logfoldchanges', min_logfoldchange=3, vmax=7, vmin=-7, cmap='bwr', groups=['ifn1-6', 'neural-3'])
WARNING: Groups are not reordered because the `groupby` categories and the `var_group_labels` are different. categories: homeostatic-0, pro-1, pro-2, etc. var_group_labels: ifn1-6, neural-3
For the following plot the we use the previously computed 'scaled' values (stored in layer scaled) and use a divergent color map.
sc.pl.rank_genes_groups_matrixplot(adata, n_genes=3, use_raw=False, vmin=-3, vmax=3, cmap='bwr', layer='scaled')
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-74-f73ddc4a2716> in <module> ----> 1 sc.pl.rank_genes_groups_matrixplot(adata, n_genes=3, use_raw=False, vmin=-3, vmax=3, cmap='bwr', layer='scaled') /usr/local/lib/python3.9/site-packages/scanpy/plotting/_tools/__init__.py in rank_genes_groups_matrixplot(adata, groups, n_genes, groupby, values_to_plot, min_logfoldchange, key, show, save, return_fig, **kwds) 816 """ 817 --> 818 return _rank_genes_groups_plot( 819 adata, 820 plot_type='matrixplot', /usr/local/lib/python3.9/site-packages/scanpy/plotting/_tools/__init__.py in _rank_genes_groups_plot(adata, plot_type, groups, n_genes, groupby, values_to_plot, min_logfoldchange, key, show, save, return_fig, **kwds) 441 from .._matrixplot import matrixplot 442 --> 443 _pl = matrixplot( 444 adata, var_names, groupby, values_df=values_df, return_fig=True, **kwds 445 ) /usr/local/lib/python3.9/site-packages/scanpy/plotting/_matrixplot.py in matrixplot(adata, var_names, groupby, use_raw, log, num_categories, figsize, dendrogram, title, cmap, colorbar_title, gene_symbols, var_group_positions, var_group_labels, var_group_rotation, layer, standard_scale, values_df, swap_axes, show, save, ax, return_fig, **kwds) 326 """ 327 --> 328 mp = MatrixPlot( 329 adata, 330 var_names, /usr/local/lib/python3.9/site-packages/scanpy/plotting/_matrixplot.py in __init__(self, adata, var_names, groupby, use_raw, log, num_categories, categories_order, title, figsize, gene_symbols, var_group_positions, var_group_labels, var_group_rotation, layer, standard_scale, ax, values_df, **kwds) 100 **kwds, 101 ): --> 102 BasePlot.__init__( 103 self, 104 adata, /usr/local/lib/python3.9/site-packages/scanpy/plotting/_baseplot_class.py in __init__(self, adata, var_names, groupby, use_raw, log, num_categories, categories_order, title, figsize, gene_symbols, var_group_positions, var_group_labels, var_group_rotation, layer, ax, **kwds) 103 self._update_var_groups() 104 --> 105 self.categories, self.obs_tidy = _prepare_dataframe( 106 adata, 107 self.var_names, /usr/local/lib/python3.9/site-packages/scanpy/plotting/_anndata.py in _prepare_dataframe(adata, var_names, groupby, use_raw, log, num_categories, layer, gene_symbols) 1843 groupby.remove(groupby_index) 1844 keys = list(groupby) + list(np.unique(var_names)) -> 1845 obs_tidy = get.obs_df( 1846 adata, keys=keys, layer=layer, use_raw=use_raw, gene_symbols=gene_symbols 1847 ) /usr/local/lib/python3.9/site-packages/scanpy/get/get.py in obs_df(adata, keys, obsm_keys, layer, gene_symbols, use_raw) 270 alias_index = None 271 --> 272 obs_cols, var_idx_keys, var_symbols = _check_indices( 273 adata.obs, 274 var.index, /usr/local/lib/python3.9/site-packages/scanpy/get/get.py in _check_indices(dim_df, alt_index, dim, keys, alias_index, use_raw) 165 not_found.append(key) 166 if len(not_found) > 0: --> 167 raise KeyError( 168 f"Could not find keys '{not_found}' in columns of `adata.{dim}` or in" 169 f" {alt_repr}.{alt_search_repr}." KeyError: "Could not find keys '['C1qc', 'Hmgb1', 'Lgmn']' in columns of `adata.obs` or in adata.var_names."
sc.pl.rank_genes_groups_stacked_violin(adata, n_genes=3, cmap='viridis_r')
#make a function that will save adata with only 50 cells per group
groups = adata.obs['clusters'].unique()
cells = []
for group in groups:
c = adata.obs_names[adata.obs['clusters'] == group]
c = sample(sorted(c),100)
cells = cells+c
print(adata)
bdata = adata[cells,]
bdata
AnnData object with n_obs × n_vars = 12338 × 1963
obs: 'Sample', 'Age', 'Condition', 'batch', 'SampleDescription', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'clusters', 'cell type'
var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'SampleDescription_colors', 'dendrogram_leiden', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap', 'clusters_colors', 'dendrogram_cell type', 'dendrogram_clusters', 'cell type_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
layers: 'scaled'
obsp: 'connectivities', 'distances'
View of AnnData object with n_obs × n_vars = 800 × 1963
obs: 'Sample', 'Age', 'Condition', 'batch', 'SampleDescription', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'leiden', 'clusters', 'cell type'
var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'SampleDescription_colors', 'dendrogram_leiden', 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap', 'clusters_colors', 'dendrogram_cell type', 'dendrogram_clusters', 'cell type_colors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
layers: 'scaled'
obsp: 'connectivities', 'distances'
sc.pl.rank_genes_groups_heatmap(bdata, n_genes=3, use_raw=True, swap_axes=True, vmin=0, vmax=3, cmap='bwr', figsize=(10,7), show=False);
Showing 10 genes per category, turning the gene labels off and swapping the axes. Notice that when the image is swapped, a color code for the categories appear instead of the 'brackets'.
sc.pl.rank_genes_groups_heatmap(bdata, n_genes=10, use_raw=True, swap_axes=True, show_gene_labels=True,
vmin=0, vmax=3, cmap='bwr')
sc.pl.rank_genes_groups_tracksplot(bdata, n_genes=3)
In scanpy, is very easy to compare marker genes using split violin plots for all groups at once.
with rc_context({'figure.figsize': (9, 1.5)}):
sc.pl.rank_genes_groups_violin(adata, n_genes=20, jitter=False)
Most of the visualizations can arrange the categories using a dendrogram. However, the dendrogram can also be plotted independently as follows:
# compute hierarchical clustering using PCs (several distance metrics and linkage methods are available).
sc.tl.dendrogram(adata, 'clusters')
ax = sc.pl.dendrogram(adata, 'clusters')
Together with the dendrogram it is possible to plot the correlation (by default 'pearson') of the categories.
ax = sc.pl.correlation_matrix(adata, 'clusters', figsize=(5,3.5))